A deep reinforcement learning algorithm for the rectangular strip packing problem

As a branch of the two-dimensional (2D) optimal blanking problem, rectangular strip packing is a typical non-deterministic polynomial (NP-hard) problem. The classical packing solution method relies on heuristic and metaheuristic algorithms. Usually, it needs to be designed with manual decisions to guide the solution, resulting in a small solution scale, weak generalization, and low solution efficiency. Inspired by deep learning and reinforcement learning, combined with the characteristics of rectangular piece packing, a novel algorithm based on deep reinforcement learning is proposed in this work to solve the rectangular strip packing problem. The pointer network with an encoder and decoder structure is taken as the basic network for the deep reinforcement learning algorithm. A model-free reinforcement learning algorithm is designed to train network parameters to optimize the packing sequence. This design can not only avoid designing heuristic rules separately for different problems but also use the deep networks with self-learning characteristics to solve different instances more widely. At the same time, a piece positioning algorithm based on the maximum rectangles bottom-left (Maxrects-BL) is designed to determine the placement position of pieces on the plate and calculate model rewards and packing parameters. Finally, instances are used to analyze the optimization effect of the algorithm. The experimental results show that the proposed algorithm can produce three better and five comparable results compared with some classical heuristic algorithms. In addition, the calculation time of the proposed algorithm is less than 1 second in all test instances, which shows a good generalization, solution efficiency, and practical application potential.


Introduction
The packing problem, also known as the nesting problem or 2D bin packing problem (2D-BPP), is a well-known combinatorial optimization problem in operations research and computer science, which widely appears in the raw material cutting links of steel plate cutting, glass cutting, leather cutting, furniture opening and other industries [1]. The main goal of packing optimization is to improve the utilization rate of raw materials so that the production cost of enterprises can be reduced and the economic benefits can be improved. The rectangular packing problem is the basis of the 2D packing problem among the classical packing problems, which widely appears in various fields of the modern manufacturing industry. The rectangular packing problem is a typical NP-hard problem whose time complexity increases exponentially with the increase in the size of the pieces [2]. Therefore, it is impossible to obtain an exact solution in polynomial time. Gradually, various polynomial-time approximation schemes (PTAS) are created, which solve these problems approximately, thus with a limited error [3][4][5]. At present, the classic packing solution is a hybrid algorithm [6] that combines intelligent algorithms (such as ant colony optimization (ACO) [7], particle swarm optimization (PSO) [8], genetic algorithm (GA) [9], etc.) and heuristic algorithms (such as the bottomleft algorithm (BL) [10], lowest horizontal line algorithm [11], etc.). Among them, intelligent algorithms are commonly used for sequencing optimization (to guide the arranged sequence), and heuristic algorithms are used for positioning algorithms (to guide the arranged angle and position). For example, Valvo et al. [12] solved the rectangular strip packing problem by combining meta-heuristic algorithms with the no-fit polygon algorithm. Vasilyev I et al. [13] put forward fast heuristic algorithms to solve the two-dimensional multiple strip packing problem (MSPP) and obtained certain results. In recent years, the heuristic algorithm has been improved in many fields as a classical computing method. AlRassas AM et al. [14] proposed a new hybrid intelligent time series model to forecast oil production. The model was developed by applying a new optimization algorithm called the Aquila optimizer (AO). Jouhari H et al. [15] solved machine scheduling problems using the improved harris hawk optimizer (HHO). The algorithm used the salp swarm algorithm to search the HHO, which improved the performance of the algorithm and reduced the calculation time. Makhadmeh SN et al. [16] and others improved and adjusted the coronavirus herd immunity optimizer (CHIO) algorithm and innovatively dealt with the discrete power scheduling problem. P Wang et al. [17] redesigned the search and attack operators of the gray wolf algorithm, and a new discrete gray wolf optimization algorithm was developed to solve the 2D rectangular strip packing problem. Ding R et al. [18] first proposed an improved ant colony optimization algorithm to solve the extensible bin packing problem. The convergence of the algorithm was improved by improving the updating method of the pheromone and adjusting parameters adaptively.
However, the intelligent algorithm has the risk of falling into a local optimum when searching, and the search time continues to extend with the increase of the problem scale. In addition, heuristic algorithms usually require decision-making for different problems to guide the solution process, which requires extensive domain and engineering knowledge. Different packing instances are usually solved separately, while the underlying patterns that these cases may share are ignored, resulting in the limitation of generality and the reduction of solving efficiency.
In recent years, artificial intelligence, especially deep learning (DL) and reinforcement learning (RL) has achieved amazing achievements in many fields [19]. Many researchers began to utilize deep reinforcement learning (DRL) [20,21] to solve combinatorial optimization problems, especially in the research directions of scheduling [22,23] and path optimization [24][25][26][27][28][29], showing the great potential of DRL to solve combinatorial optimization problems. The research progress from sequence to sequence models [30] has stimulated scholars' research on neural combinatorial optimization. The first modern deep model used for combinatorial optimization problems is PtrNet, proposed by Viyals O et al. [31]. This study proposes a neural network architecture with a specific attention mechanism and uses a supervised learning method to solve the traveling salesman problem (TSP). [32] used RL training on the basis of PtrNet to avoid the use of real labels that require a large number of optimal solutions, which makes it possible to obtain the optimal solution for large-scale TSP problems and knapsack problems while saving computational costs. Inspired by the successful solution of the DRL method to the path and scheduling optimization problems, some scholars have performed research on the 3D bin packing problem. Hu H et al. [19] optimized the sequence of 3D bin packing with the DRL method, and the heuristic algorithm was combined to optimize the surface area of bins. Lu Duan et al. [33] extended the method [19] by designing a multitask selective learning (MTSL) framework, in which reinforcement learning and supervised learning was used to learn the sequence and orientation of 3D bin packing. Yuan Jiang et al. [34] solved the 3D bin packing problem through DRL and constraint programming, then excellent solutions for large-scale instances were produced.
However, rectangular packing is very different from 3D bin packing in solving operations, model building, application scenarios, etc. There is almost no research on directly using the DRL method to solve 2D piece packing. In recent years, some researchers have tried to solve the problem with machine learning methods according to the characteristics of 2D piece packing, and certain achievements have been achieved. S Bohm et al. [35] compared the effects of heuristic algorithms, constraint optimization, and reinforcement learning for the 2D packing problem of wood allocation in the household industry, which showed that greedy heuristic and constraint optimization had better effects but took a long time, and the DRL method was not suitable for this instance. Hao Zhang [36] predicted the blanking results of new order tasks by using the RandomForest model and XGBboost model in machine learning, which helped enterprises to preschedule production and estimate costs. Xiaofei Xu et al. [1] tried to solve the rectangular packing problem by using transfer ant colony reinforcement learning, and with the help of the "trial and error" learning mode, the acquisition and update of the knowledge were completed by using an ant colony with self-learning ability in the knowledge matrix. J Fang et al. [37,38] and X Zhao et al. [39] used the Monte Carlo (MC) algorithm and Q-learning algorithm in reinforcement learning to solve the sequence optimization problem in 2D irregular piece packing and rectangular packing, respectively, but the search had a certain randomness. The combination of machine learning and intelligent algorithms has also achieved certain achievements in some fields. Zivkovic M et al. [40] developed a hybrid version of the arithmetic optimization algorithm (AOA), which was used to tune XGBoost hyperparameters for COVID-19 chest X-ray images and obtained better results than other cutting-edge metaheuristic algorithms. Alqahtani et al. [41] proposed a method based on a deep learning networks to optimize the intelligent network intrusion detection system and used the long shortterm memory (LSTM) approach to predict the results. Bacanin N [42] et al. combined machine learning models with an enhanced sine cosine swarm intelligence algorithm to train logical regression and tune XGBoost models. This research makes up for shortcomings of the existing technology and has been well applied in spam filtering. K. Venkatachalam et al. [43] used the state-of-art deep learning model long short-term memory and the transductive long short term memory model improved the accuracy of weather forecasting, and the loss and mean absolute error were used to evaluate the results.
The self-learning rule of classic reinforcement learning is used by most machine learning methods for solving the 2D packing problem in the latest packing research, while the search has certain randomness and instability. Moreover, different packing rules need to be designed separately for these algorithms with search properties, and underlying patterns shared in different packing instances are not considered, which results in weak universality, small solution scale, and high time cost. The latest literature review on intelligent packing shows that [44] the algorithm of machine learning and deep learning may be able to better optimize the packing sequence and obtain a good optimization effect, but there is a lack of exploration of this research direction at present. Since the pointer network constructs solutions in an autoregressive way by using the Attention mechanism [45], it is suitable for solving sequence combinatorial optimization problems. The research of packing algorithm based on DRL is not only an early attempt to solve the packing problem by applying deep network but also promotion and inheritance of existing research findings. The development of this novel algorithm based on the neural networks can not only provide a new idea for the online solution of packing problems but also provide a new reference for solving more combinatorial optimization problems, which has great theoretical significance and application potential.
In this paper, a DRL algorithm is proposed for the first time to solve the rectangular strip packing problem. A heuristic algorithm based on maximal rectangles bottom-left (Maxrects-BL) is designed for positioning optimization of piece packing. In addition, an advanced deep architecture (encoder-decoder structure), called a pointer network, is used as a policy network. The underlying patterns and RL training of a large number of rectangular pieces packing are exploited to automatically learn the packing rules. Finally, an instances-based packing test is performed in the generalized model, which proves the effectiveness of the algorithm. Further, the proposed algorithm in this paper can not only avoid designing heuristic rules for different instances separately but also use the deep network with self-learning characteristics to solve different instances more widely, which has good universality and solution efficiency. In this paper, 2D packing problems are summarized first. Subsequently, the 2D rectangular strip packing problem is modeled and described. Secondly, the positioning strategy based on the Maxrects-BL algorithm is introduced, and the method of sequence optimization based on the DRL is explained. Finally, the experimental parameters and experimental results are analyzed and summarized.

Problem statement for rectangular packing
Due to the requirements of the production process and the limitation of processing conditions, the optimal blanking problem in different fields corresponds to different mathematical models [46]. According to the classification rules of Wscher et al. [47], the research object of this paper is the 2D rectangular piece strip packing problem (2DR-SP) with sufficient raw materials, in which the piece is the object to be arranged, and the plate is the 2D carrier for arranging the piece. The rectangular strip packing problem can be described as follows. Suppose a set of n rectangular pieces to be packed are {P 1 , P 2 , P 3 . . ., P n }, and the piece number follows an ordered natural sequence. The width and height of the i-th rectangle are set as w i and h i , respectively. Then the rectangle piece P i can be expressed as (w i , h i ) (i = 1,2,3. . ., n). The width of the fixed rectangular plate is set to W, and the height is not limited. After the rectangular pieces are arranged, the height of the plate corresponding to the highest contour line is set to H, and the utilization rate of the plate is set to U.
The 2DR-SP refers to n rectangular pieces packed on a rectangular plate of a certain size in a certain sequence to maximize the utilization of the plate. The following constraints need to be met during the packing process: 1) rectangular pieces do not exceed the plate boundary; 2) there is no overlap between any rectangular pieces; 3) the rectangular piece is allowed to rotate 90 degrees; and 4) the edges of the rectangular piece are parallel to the plate boundary. Part of the research ideas of Korf R E et al. [48] is used for reference when establishing the mathematical model of the rectangular strip packing problem. The objective function of the piece packing is shown in Formula (1), and the constraints are shown in Formula (2). s:t: Further, a 2D cartesian coordinate system is established on the plate, as shown in Fig 1. The vertex of the lower left corner of the rectangular plate is defined as the coordinate origin O (0,0), and the width W and height H of the rectangular plate correspond to the X axes and Y axes in the coordinate system, respectively. Then a small rectangular piece P i (i = 1, 2,

PLOS ONE
3. . ., n) is placed into the plate boundary, and the width and height of the piece are represented by w i and h i , respectively. The abscissa and ordinate of the lower left corner after the piece packed is represented by x i and y i , respectively.

Positioning strategy based on Maxrects-BL
Piece positioning optimization refers to determining the placement position and angle of pieces on the plate on the basis of sequencing optimization. It can be seen from the above that the piece positioning algorithm is a heuristic algorithm. The BL algorithm is one of the most classic heuristic algorithms with simple rules, O (n 2 ) time complexity, and few parameters, and is proposed by Baker et al. [49] in the early 1980s. The basic flow of the BL algorithm is shown in Algorithm I. However, in the packing process based on the BL algorithm, large rectangles may block the movement of the unarranged rectangle, which can easily generate a large area of hole waste. To overcome these shortcomings, in this paper, we optimize the BL algorithm to the maximum rectangles bottom-left (Maxrects-BL) algorithm as the positioning strategy for rectangular packing. Furthermore, the positioning strategy is not only a key step in forming the packing results but also an important part of the RL training for network parameters.
Algorithm Ⅰ: The BL algorithm 1 Input: A rectangular piece sequence (P 1 , P 2 , P 3 . . ., P n ) and plate information 2 Place piece P 1 at the position of the plate origin O (0,0) 3 Under the conditions of Formula (2), move the piece P i vertically downward from the upper right corner of the plate, and then move horizontally to the left until the piece P i touches the boundary of the plate or the contour of the piece 4 Repeat the process in Step2 until all the pieces are packed into the plate 5 Output: Output layout and plate utilization rate The maximum rectangles (Maxrects) algorithm is an extension and improvement of the guillotine split placement rule [50] in a sense. They both store a list of free rectangles representing the free area of the plate. The difference is that the Maxrects algorithm performs the horizontal and vertical segmentation at the same time, which is more efficient, and the segmentation process is shown in Fig 2. When a rectangular piece P is placed on a rectangular

PLOS ONE
plate F, the remaining L-shaped area F\P can be divided into rectangles F 1 and F 2 , and a new plate list L can be obtained, . ., F n } is set to be the set of maximum free rectangles, which is the remaining free area in the plate in a certain packing step based on the Maxrects algorithm. This can ensure that more valid positions are considered when placing rectangular pieces. All other rectangles F j 2L, P\F j 6 ¼; will be checked and updated after piece P is packed into plate F each time. After this step, nonmaximal rectangles may be left in L, so the free rectangle F i 2L needs to be removed in time if there is another rectangle F j 2L, i6 ¼j, for which F i �F j . The Maxrects algorithm is given in Algorithm II.
Decide the free rectangle F i to pack the rectangle P into 5 Decide the orientation for the rectangle and place it at the bottom-left of F i 6 Denote by B the bounding box of P in the plate after it has been positioned 7 Subdivide F i , update set L 8 foreach free rectangle F2L do 9 Compute F\B, subdivide it into at most four new rectangles F 1 , Set L L\{F j } 15 end 16 end 17 return the size of the rectangle 18 end procedure The Maxrects-BL algorithm combines the data structure of the maximal rectangles and the BL algorithm to realize the positioning of rectangular pieces on the plate [51]. The rectangular layout based on the Maxrects-BL algorithm is shown in Fig 3, which refers to the research of Jukka et al. [52]. The maximum rectangular border in the free area is painted blue, orange, and red, respectively.

Sequence optimization based on the DRL algorithm
In this section, a DRL algorithm is described to solve the sequence optimization problem for rectangular packing, and the positioning algorithm described above is used as a placement strategy for rectangular pieces. Furthermore, the policy network based on the encoder-decoder structure and RL training are described.

Network architecture
In the studies of Vinyals O et al. [31] and Bello I et al. [32], a network architecture (PtrNet) called Pointer Network was proposed to solve some classical combinatorial optimization problems, such as the TSP and Knapsack problem. The architecture is similar to the sequence-to-sequence (seq2seq) model [30,53]. The PtrNet has two main differences from these seq2seq models. On the one hand, the original seq2seq model uses a fixed representation of the input sequence, and the output is also fixed, while in PtrNet, the output dictionary size is variable. On the other hand, in the sequence model based on the attention mechanism, the attention mechanism is used to mix the hidden unit of the encoder with the context vector, and there is still a problem that the size of the output dictionary depends on the input. Nevertheless, in PtrNet, attention is used as a pointer to select members in the input sequence. In this paper, the Pointer network is used as the policy network, whose neural network architecture is shown in Fig 4. The pointer network contains two Recurrent neural network (RNN) modules: the decoder and the encoder, both of which consist of Long short-term memory (LSTM) units [54]. The sequence information is input to the encoder at each time step, and one piece is chosen at a time, which is transformed into a sequence of latent memory states {enc i } n i = 1 , where enc i 2 R d . Then, the latent memory state is fed into the LSTM cell as the input, and the cell output is collected. When the end of the input sequence is reached, a special symbol ()) is input into the model. The cell state and the output are provided to the decoder network in the final step of the encoder network. In the decoder, the network model is transformed into a generation mode, and the latent memory states {dec i } n i = 1 are also maintained, where dec i 2R d . Furthermore, one output of the encoder network is chosen as the input to the next decoder step at each time step. For example, as shown in Fig 4, the output of step 3 of the decoder network is 2, and the input of the corresponding encoder is (w 2 , h 2 ), so the output (w 2 , h 2 ) of step 4 of the encoder network is pointed (selected) as the input of step 4 of the decoder network. In the decoder network, the attention mechanism and glimpse mechanism [32] are used to integrate the output information of the decoder unit and the output information of the encoder network to predict the next packed piece. When the end of the output sequence is reached, a special symbol (() is encountered, which indicates the termination of the output sequence.
For the rectangular packing problem, the input to the network is a sequence of pieces to be packed, which contains the size data (width and height) of the pieces. Then, the d-dimensional embedding of the 2D piece coordinate information is input to the encoder at the i-th time step, which is obtained by the linear transformation of size data (w i , h i ) shared by all input

PLOS ONE
steps. The output of the network is another packing sequence that represents the order in which pieces are packed. Define a set of piece data pairs (P, C p ), where P = {P 1 , P 2 . . ., P n } is the set of pieces to be packed on a single plate, which is also a sequence of n vectors. C p = {C 1 , C 2 . . ., C m(P) } is the optimal sequence of the piece packing, in which m(P) indices are included. Each index is between 1 and n, and m(P) is the length of the target sequence.
The RNN network with parameter ɵ is used as the base network, and the conditional probability p (C p |P, ɵ) is estimated according to the chain rule. This means that under the guidance of the neural network parameterized to ɵ, the probability of selecting the optimal sequence C p for pieces packing, as identified in Formula (3). In addition, p (C i |C 1 , C 2 . . .C i-1 , P) is modeled by a pointer network with an attention mechanism, as shown in Formula (4) and Formula (5). The parameters of the model are updated by maximizing the conditional probability of the training set, as given in Formula (6).

PLOS ONE
Among them, the vector u i is normalized to an output distribution over the dictionary of inputs by the softmax function. v, W 1 and W 2 are the learnable parameters of the output model. v T is the transposition of v. u i j is used as a pointer to the input element. Moreover, the hidden states of the encoder and decoder are (e 1 , e 2 . . ., e n ) and (d 1 , d 2 . . ., d m(P) ), respectively. It is worth emphasizing that at each time step of the decoder, a new attention vector is generated, which is calculated by performing a softmax operation on all elements along j so that the value with the strongest correlation of input information can be obtained. In our experiments, the hidden dimensionality of the encoder and the decoder are set to be the same, which is the classical 128. Therefore, v is a vector, and W 1 and W 2 are square matrices.

RL training
In the study of Vinyals O et al. [31], supervised learning is used to train the pointer network. This requires extremely high quality and quantity of original labels, which is difficult to operate in practical applications. However, RL based on a model-free strategy can interact with the environment through agents to perform autonomous learning and decision optimization [55], which has received widespread attention. A strategy based on the RL algorithm is used to train the neural network in this section. According to the above, the input of the pointer network is the sequence information composed of P = {(w i , h i )} n i = 1 , where w i and h i represent the width and height of the i-th rectangular piece, respectively. The p is used to represent the sequence information of different samples. The output of the network is the packed sequence, which is represented by o. A (o | p) is not only used to represent the plate area after rectangular packing but also used to evaluate the packing sequence. Then, the goal of rectangular packing is to obtain the current minimum packing area A(o|p) under the guidance of the sequence o, combined with the positioning strategy based on Maxrects-BL. In other words, the minimum packing height H min can be obtained because the width W of the plate is constant. Thus, the current maximum packing utilization rate can be obtained. The stochastic policy of a neural network parameterized as ɵ can be defined as π (o| p, ɵ), which represents the probability that the sequence o is selected for packing, given a series of piece information p. The goal of training is to give a high probability to the selected sequence o to obtain a correspondingly small packing area. In short, the goal of training is to obtain the expected packing area, which is defined as Formula (7).
In the training process, the packing sequence is obtained from the distribution p, and the overall training objective includes sampling from the distribution of the pieces set, i.e., J(θ) = E p*P J (θ | p). Among the policy-based RL algorithms, the REINFORCE algorithm proposed by Williams [56] is a general class of associative RL algorithms. It is also a strategy gradient algorithm updated based on the MC and can adjust the weight in the direction along the expected reinforcement gradient. In this paper, the REINFORCE algorithm is used for parameter training. The basic idea is that the algorithm is updated after each episode (a complete traversal of a sample sequence of pieces to be packed) is completed. After the probability distribution of rewards and predictions is acquired, the parameter ɵ of the neural network is incremented by an amount. The parameter optimization based on the policy gradient is shown in the Formula (8).
r y JðyjpÞ ¼ E o�p y ð:jpÞ AðojpÞr y logp y ðojpÞ ð8Þ The REINFORCE algorithm is a policy gradient method based on MC sampling, which adopts a turn-based update method. That is, the reward value is returned after each episode is completed. Furthermore, the value function is usually replaced by the average reward in the MC-based method, as the description in the related research [37]. For each sample p, the output of the neural network is randomly sampled according to the probability distribution given by the policy network in the training stage. Define the number of complete trajectories sampled by each gradient update as N, resulting in N i.i.d. Sampling p 1 , p 2 . . ., p N~L (L is the training set based on samples), the gradient in Formula (8) can be approximated with the MC sampling as shown in Formula (9).
In the above formula, the packing area A (o i |p i ) of the plate is obtained by combining the packing sequence o i with the positioning strategy based on Maxrects-BL. In the process of parameter optimization, an improved stochastic gradient descent method-adaptive moment estimation (Adam) [57] is used to update the network parameter. In the test stage, the greedy strategy is applied. That is, the prediction with the highest probability will be selected as the output in each time step. Thus, the process of network training based on RL can be summarized in Algorithm III. Furthermore, the flow of the rectangular piece strip packing algorithm (SP-DRL) based on deep reinforcement learning is presented in Fig 5. Algorithm Ⅲ: RL training 1 procedure Scale of training data M, training set L, training steps T, batch size B 2 Initialize Pointer network params ɵ, training steps t = 1 3 for t = 1 to T do 4 Select a batch of sample p i for i2{1, 2. . ., B} 5 Send p i to pointer network, sample packing sequence o i based on π θ (.|p i ) for i2{1, 2. . ., B} 6 Obtain packing area A(o i |p i ) by combining Maxrects-BL positioning strategy Aðo i jp i Þr y logp y ðo i jp i Þ 8 θ ADAM(θ,g θ ) 9 end for 10 return ɵ 11 end procedure

Computational experiments
The Python programming language is used for the project, and the neural network construction method based on PyTorch [58] is adopted. Computation tests of the SP-DRL are performed on a machine with a 2.30 GHz AMD Ryzen 7 3750H CPU with 4 kernels and 16 GB of RAM. The test instances are selected from the stochastic training set and the rectangular packing problem instance, where the rectangular packing problem instance is also called the benchmark instance in other literature, from the EURO Special Interest Group on Cutting and Packing (ESICUP, https://www.euro-online.org/websites/esicup/data-sets/). Since there is currently no large-scale training dataset for rectangular pieces, the rectangular piece in practical applications can be obtained by scaling or expanding the size of general rectangular pieces. Therefore, in this paper, the dataset is manufactured by the method of automatic generation. The specific method of generating data is as follows: a complete rectangular piece is set, which can also be used as a rectangular plate. The size is set as (W I , H I ), the number of pieces to be generated as n, and the set I = {(W I , H I )} is initialized. Subsequently, a piece is randomly

PLOS ONE
selected from set I, an edge is randomly selected from the edge set of the piece, and the coordinate of a cutting point is randomly obtained so that the piece is divided into two pieces according to the cutting point. It is required that the minimum side length of the cut piece is H Imin , and the maximum side length is H Imax . Then, the cut pieces are added to set I, while the original pieces in the set I are deleted. The same operation is performed on all dates M, and the piece training set L is obtained by initialization.
In the experiment, the training set is obtained through automatic generation, and the test set is composed of test instances. Each piece (W I , H I ) can be divided into 100 small rectangular pieces at most, where W I takes the value of 1, H I takes the value of 2, H Imin is set to 0.2, and H Imax is set to 0.6. The number of samples M is set to 1600, and the batch size B is set to 32, the training steps T is set to 50, and the hidden dimension of LSTM cells is set to 128. Adam is used as the model optimizer, the initial learning rate is set to le-3, the weight decay is set to le-5, and the discount factor of reward is set to 1. The loss curve for model training is shown in Fig 6. Table 1 not only provides 4 test instances based on random piece size and 18 instance information based on classical heuristic algorithms (SA+BLF [59], GA+BLF [59], SRA [60,61]) but also shows the results of the Maxrects-BL and the SP-DRL algorithm in these instances. A smaller packing height means a higher utilization rate of the plate, which corresponds to a better packing scheme. The test instances based on the benchmark in Table 1 are scaled down according to the size of the pieces during training, and the reduced plate size retains two significant figures to better reflect the generalization performance of the neural network model. The instance data is enlarged according to the above ratios so as to better compare with the results in the relevant literature.
By analyzing Table 1, it can be seen that the packing model based on DRL has good generalization performance, and the rectangular packing of various instances can be realized in a very

PLOS ONE
short time. In the same instance, the results of different algorithms show the same changing trend, and the results are not different. In different instances, the solution results of different algorithms show great volatility, which indicates the performance difference of different algorithms in solving different instances. By comparing the calculation results of the SP-DRL and the Maxrects-BL, it can be seen that the algorithm based on SP-DRL can obtain a smaller packing height in all instances except for instance C12, and a larger packing utilization rate can be obtained, which illustrates the effectiveness of sequence optimization based on the DRL. By comparing the calculation results of the SP-DRL and three classical heuristic algorithms (SA +BLF, GA+BLF, SRA), it can be seen that the proposed SP-DRL algorithm produces three better results (C21, C23, C33) and five comparable results (C13, C22, C31, C41, C53) with a height difference of 1, which are highlighted and underlined, respectively. Specifically, the packing height calculated by the SP-DRL algorithm on instance C21 is 16, which is consistent with the result calculated based on the three classical heuristic algorithms. However, the packing height calculated by the SP-DRL algorithm on instances C23 and C33 is 15 and 31.63, respectively, which exceeds the calculation results of SA+BLF and other algorithms. The packing utilization rate calculated by the SP-DRL algorithm on instance C23 is 100%, which reaches the level of an excellent solution. With the increase of instance number, the number of pieces to be packed increases, and the calculation complexity also increases. The packing height calculated by the SP-DRL algorithm on instance C13 is 21, which is 1 more than that calculated by the SA+BLF algorithm, GA+BLF algorithm, and SRA algorithm. For instance In addition, in instance C51, the SP-DRL algorithm obtained a smaller packing height than the SA+BLF and GA+BLF algorithms. It is worth emphasizing that the SP-DRL algorithm in the 18 instances based on the benchmark has a shorter calculation time and no more than 1 second, which has higher solution efficiency. It can be seen from the comparative test experiments that the SP-DRL algorithm proposed in this paper not only has a good generalization ability, and can obtain better solutions in multiple standard instances but also has great advantages in solution efficiency and good application potential. The partial layout of the optimal solution based on the SP-DRL algorithm is displayed in Fig 7.

Discussion and analysis
It can be seen from the experimental results that the SP-DRL algorithm can efficiently complete the packing of rectangular pieces of different sizes, which is related to the good generalization performance of the DRL model. The model can automatically learn the packing rules by using the underlying pattern shared by piece packing and the update of packing parameters, which has strong universality. In the model training process based on RL, the reward will be returned only when each piece to be packed is traversed, which coincides with the reward return strategy of MC, so the packing parameters can be better learned and updated. Furthermore, the Adam optimizer calculates the adaptive learning rate of different parameters from the budget of the moment of the gradient, which can make the training converge to better performance. Part of the curve in Fig 6 suddenly demonstrates an upwards trend, which may be caused by the adjustment of the learning rate. The calculation results based on the SP-DRL algorithm can get certain better or comparable results in some instances. However, with the increase in piece size and scale, the packing effect is affected to a certain extent, which is related to the training and parameter adjustment of the deep network. The calculation time of the  SP-DRL algorithm in 18 instances of the benchmark is shorter and no more than 1 second, which reduces the time cost of packing and shows the good generalization performance of pieces packing based on deep network. Further, the experimental results show that the SP-DRL algorithm has great application potential in solving large-scale rectangular packing problems that focus on time cost. Since there is no large-scale training set for rectangular pieces, the method of automatically generating pieces is used to generate approximately 160000 small rectangular pieces for training from multiple complete large rectangular pieces in this paper. Therefore, the size of the test set is limited by the size of the sample. In Table 1, the pieces with different scales and sizes on instance C are scaled proportionally to gain better results that adapt to the model performance. However, the deviation of piece size is also caused by this operation, and with the increase in the number of pieces, the deviation may be larger, which will affect the packing result. Further, the standard-sized pieces and the reduced-sized pieces on instance C are used to perform the packing calculation based on the Maxrects-BL algorithm. The results of partial instances are given in Table 2, where the partial layout of the experimental results is shown in Fig 8. By analyzing the data in Table 2, the scaling of the size of rectangular pieces will affect the packing effect to a certain extent. Further, if there is a large-scale training set based on the real piece size, better generalization performance will be achieved by the DRL model, and a better packing effect can be obtained. In addition, the deep network has many parameters, and the setting of parameters will directly affect the packing effect based on DRL. Once pieces in the new packing task are not trained, the effect of the network application is limited. Due to the limitation of the pointer network on input setting, the proposed algorithm in this paper cannot be

PLOS ONE
directly applied to the solution of 2D irregular piece packing, so its contribution to 2D pieces packing is limited.

Conclusions and future work
In this paper, a deep reinforcement learning algorithm for the rectangular strip packing problem is proposed for the first time, which is also the first study of a deep neural network for the 2D piece packing problem. Specifically, a pointer network with an encoder and decoder structure is used as the basic network, and a model-free reinforcement learning algorithm is designed to train the network parameter so that a better packing sequence can be obtained. In the process of piece packing, a positioning strategy based on the Maxrects-BL algorithm is used to determine the placement position of the pieces on the plate and calculate the model reward and packing parameters, which further improves the packing effect. On the one hand, the packing of rectangular pieces with different scales and sizes is realized by the algorithm based on SP-DRL, there is no need to design rules separately for different instances, and a more efficient method of packing is formed. On the other hand, the algorithm based on SP-DRL can obtain better solutions on multiple instances, which has a strong potential for practical applications. The work in this paper is an early attempt to apply DRL to solve the 2D packing problem, which not only promotes and inherits the existing packing research findings but also provides a new idea for the online solution of packing problems, and provides a new reference for solving more combinatorial optimization problems. Moreover, the research of this paper expands the application field of artificial intelligence and has far-reaching theoretical and practical significance. The training data are generated randomly, which is inspired by the literature [62]. In future work, a real dataset based on the actual production of enterprises will be built, and the proposed algorithm will be tested on this dataset. In addition, we will also study the integration of the sequencing strategy and positioning strategy in the packing problem into the neural network architecture to solve the large-scale packing and production problem more coordinated and efficiently. Further, we will take the research of deep network as the basis, and take the packing algorithm based on knowledge transfer and transfer learning as a research direction to better solve the 2D piece packing problem in heavy industry production.